370 8.5 Advanced In Silico Analysis Tools
of PCA enables identification of the principal directions in which the data vary. In essence,
the principal components, or normal modes, of a general dataset are equivalent to axes that
can be in multidimensional space, parallel to which there is most variation in the data. There
can be several principal components depending on the number of independent/orthogonal
features of the data. In terms of images, this in effect represents an efficient method of data
compression, by reducing the key information parallel to typically just a few tens of principal
components that encapsulate just the key features of an image.
In computational terms, the principal components of an image are found from a stack
of similar images containing the same features and then calculating the eigenvectors and
eigenvalues of the equivalent data covariance matrix for the image stack. An eigenvector
that has the largest eigenvalue is equivalent to the direction of the greatest variation, and the
eigenvector with the second largest eigenvalue is an orthogonal direction that has the next
highest variation, and so forth for higher orders of variation beyond this. These eigenvectors
can then be summed to generate a compressed version of any given image such that if more
eigenvectors are included, then the compression is lower and the quality of the compressed
image is better.
Each image of area i × j pixels may be depicted as a vector in (i × j)2 dimensional hyper
space, which has coordinates that are defined by the pixel intensity values. A stack of such
images is thus equivalent to a cloud of points defined by the ends of these vectors in this
hyperspace such that images that share similar features will correspond to points in the
cloud that are close to each other. A pixel-by-pixel comparison of all images as is the case
in maximum likelihood methods is very slow and computationally costly. PCA instead can
reduce the number of variables describing the stack imaging data and find a minimum
number of variables in the hyperspace, which appear to be uncorrelated, that is, the prin
cipal components. Methods involving multivariate statistical analysis (MSA) can be used to
identify the principal components in the hyperspace; in essence, these generate an estimate
for the covariance matrix from a set of images based on pairwise comparisons of images and
calculate the eigenvectors of the covariance matrix.
These eigenvectors are a much smaller subset of the raw data and correspond to regions
in the image set of the greatest variation. The images can then be depicted in a compressed
form with relatively little loss in information since small variations are in general due to noise
as the linear sum of a given set of the identified eigenvectors, referred to as an eigenimage.
PCA is particularly useful in objectively determining different image classes from a given
set on the basis of the different combinations of eigenvalues of the associated eigenvectors.
These different image classes can often be detected as clusters in m-dimensional space where
m is the number of eigenvectors used to represent the key features of the image set. Several
clustering algorithms are available to detect these, a common one being k-means that links
points in m-dimensional space into clusters on the basis of being within a threshold nearest-
neighbor separation distance. Such methods have been used, for example, to recognize
different image classes in EM and cryo-EM images in particular (see Chapter 5) for molecular
complexes trapped during the sample preparation process in different metastable conform
ational states. In thus being able to build up different image classes from a population of sev
eral images, averaging can be performed separately within each image class to generate often
exquisite detail of molecular machines in different states. From simple Poisson sampling
statistics, averaging across n such images in a given class reduces the noise on the averaged
imaged by a factor of ~√n. By stitching such averages from different image classes together,
a movie can often be made to suggest actual dynamic movements involved in molecular
machines. This was most famously performed for the stepping motion of the muscle protein
myosin on F-actin filaments.
Algorithms involving the wavelet transform (WT) are being increasingly developed to
tackle problems of denoising, image segmentation, and recognition. For the latter, WT is a
complementary technique to PCA. In essence, WT provides an alternative to the DFT but
decomposes an image into two separate spatial frequency components (high and low). These
two components can then be combined to the given four separate resulting image outputs.
The distribution of pixel intensities in each of these four WT output images represents a
compressed signature of the raw image data. Filtering can be performed by blocking different